{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "4-reuters.ipynb",
      "version": "0.3.2",
      "views": {},
      "default_view": {},
      "provenance": [],
      "collapsed_sections": []
    }
  },
  "cells": [
    {
      "metadata": {
        "id": "j1bsrXvzxBbl",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "# Classifying newswires with tf.keras, tf.data, and eager execution.\n",
        "\n",
        "In this lab, you'll learn how to classify newswires from the [Reuters dataset](https://www.tensorflow.org/api_docs/python/tf/keras/datasets/reuters) into categories. This tutorial was inspired by [this one](https://github.com/fchollet/deep-learning-with-python-notebooks/blob/master/3.6-classifying-newswires.ipynb). This notebook is similar to previous one. The principal difference is here, we will explore a multiclass text classification problem.\n"
      ]
    },
    {
      "metadata": {
        "id": "fGBZG7rvwf2C",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "cell_type": "code",
      "source": [
        "!pip install -q -U tensorflow==1.8.0\n",
        "import tensorflow as tf\n",
        "\n",
        "# Enable eager execution\n",
        "tf.enable_eager_execution()\n",
        "\n",
        "import numpy as np"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "_39NLiiGy9ey",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "\n",
        "## A first look at the dataset\n",
        "\n",
        "### Step 1) Downloading the dataset\n",
        "\n",
        "The dataset consists of a set of short [newswires](https://www.google.com/search?q=what+is+a+newswire) and their corresponding topics as published by Reuters in 1986."
      ]
    },
    {
      "metadata": {
        "id": "F-DuuRzay9KG",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "cell_type": "code",
      "source": [
        "(train_data, train_labels), (test_data, test_labels) = tf.keras.datasets.reuters.load_data(num_words=10000)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "arCrVEMGzR7t",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### Step 2) Exploring the dataset \n",
        "\n",
        "There are around 46 different topics. The set is divided into 8982 training examples and 2246 test examples. Each training example is a list of words represented as integers similar to the IMDB dataset, and the labels are integers up to 46."
      ]
    },
    {
      "metadata": {
        "id": "KL6sMDLVzNpF",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "cell_type": "code",
      "source": [
        "print(train_data[0])\n",
        "print(train_labels[0])\n",
        "print(len(train_data))\n",
        "print(len(test_data))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "T9DdCQVQRlJQ",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### Converting the integers to words\n",
        "\n",
        "We can get the dictionary from `tf.keras.datasets.reuters.get_word_index` which hashes the words to their corresponding integers. Let's try and convert a newswire from integers back into it's original text by first reversing this dictionary and then iterating over a newswire and converting the integers to strings."
      ]
    },
    {
      "metadata": {
        "id": "nVdKXJ7PRmUb",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "cell_type": "code",
      "source": [
        "# Dictionary that hashes words to their integer\n",
        "word_to_integer = tf.keras.datasets.reuters.get_word_index()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "b8iedZIuRoIG",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "cell_type": "code",
      "source": [
        "print(list(word_to_integer.keys())[0:10])\n",
        "\n",
        "integer_to_word = dict([(value, key) for (key, value) in word_to_integer.items()])\n",
        "\n",
        "# Demonstrate how to find the word from an integer\n",
        "print(integer_to_word[1])\n",
        "print(integer_to_word[2])\n",
        "\n",
        "import random\n",
        "\n",
        "random_index = random.randint(0, 100)\n",
        "# We need to subtract 3 from the indices because 0 is \"padding\", 1 is \"start of sequence\", and 2 is \"unknown\"\n",
        "decoded_newswire = ' '.join([integer_to_word.get(i - 3, 'UNK') for i in train_data[random_index]])\n",
        "print(decoded_newswire)\n",
        "print(train_labels[random_index])"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "NZYEERu2R0y1",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### Step 3) Format the data\n",
        "As before, we are going to multi-hot encode our newswire which is an array of integers into a 10,000 dimensional vector. We will place 1's in the indices of word-integers that occur in the newswire, and 0's for everything else."
      ]
    },
    {
      "metadata": {
        "id": "nbTtNYCIR4hs",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "cell_type": "code",
      "source": [
        "def vectorize_sequences(sequences, dimension=10000):\n",
        "    # Create an all-zero matrix of shape (len(sequences), dimension)\n",
        "    results = np.zeros((len(sequences), dimension), dtype=np.float32)\n",
        "    for i, sequence in enumerate(sequences):\n",
        "        results[i, sequence] = 1.  # set specific indices of results[i] to 1s\n",
        "    return results\n",
        "\n",
        "\n",
        "train_data = vectorize_sequences(train_data)\n",
        "test_data = vectorize_sequences(test_data)\n",
        "\n",
        "print(train_data.shape)\n",
        "print(train_data[0])"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "MKo1m1ZrSELb",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### Step 4) Format the labels\n",
        "\n",
        "We will also use [tf.keras.utils.to_categorical](https://www.tensorflow.org/api_docs/python/tf/keras/utils/to_categorical) to one hot encode our labels."
      ]
    },
    {
      "metadata": {
        "id": "b5Ew7zv-SJK-",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "cell_type": "code",
      "source": [
        "LABEL_DIMENSIONS = 46\n",
        "\n",
        "print(train_labels[0]) # Before\n",
        "train_labels  = tf.keras.utils.to_categorical(train_labels, LABEL_DIMENSIONS)\n",
        "print(train_labels[0]) # After\n",
        "\n",
        "test_labels = tf.keras.utils.to_categorical(test_labels, LABEL_DIMENSIONS)\n",
        "\n",
        "# Needed later\n",
        "train_labels = train_labels.astype(np.float32)\n",
        "test_labels = test_labels.astype(np.float32)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "mupgnRsQSa6k",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### Step 5) Create the model\n",
        "\n",
        "Our model is similar to the previous notebook, modified to work for a multiclass classification problem."
      ]
    },
    {
      "metadata": {
        "id": "VLsDpdIoSaIU",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "cell_type": "code",
      "source": [
        "### Create a model\n",
        "model = tf.keras.Sequential()\n",
        "\n",
        "model.add(tf.keras.layers.Dense(64, activation=tf.nn.relu, input_shape=(10000,)))\n",
        "model.add(tf.keras.layers.Dense(64, activation=tf.nn.relu))\n",
        "model.add(tf.keras.layers.Dense(LABEL_DIMENSIONS, activation=tf.nn.softmax))\n",
        "\n",
        "optimizer = tf.train.RMSPropOptimizer(learning_rate=0.001)\n",
        "\n",
        "model.compile(loss='categorical_crossentropy',\n",
        "              optimizer=optimizer,\n",
        "              metrics=['accuracy'])\n",
        "\n",
        "model.summary()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "E0KQorztSW0n",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### Step 6) Validation set\n",
        "\n",
        "As before, we want to test our model on data it hasn't seen before, but before we get our final test accuracy. Here, we'll create a validation set."
      ]
    },
    {
      "metadata": {
        "id": "Ci6jFHB9TbWA",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "cell_type": "code",
      "source": [
        "VAL_SIZE = 1000\n",
        "\n",
        "val_data = train_data[:VAL_SIZE]\n",
        "partial_train_data = train_data[VAL_SIZE:]\n",
        "\n",
        "\n",
        "val_labels = train_labels[:VAL_SIZE]\n",
        "partial_train_labels = train_labels[VAL_SIZE:]"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "CaNEaxWkTkAE",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### Step 7) Create a tf.data Dataset"
      ]
    },
    {
      "metadata": {
        "id": "Zx7aLE4vTnsS",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "cell_type": "code",
      "source": [
        "BATCH_SIZE = 512\n",
        "TRAINING_SIZE = partial_train_labels.shape[0]\n",
        "\n",
        "training_set = tf.data.Dataset.from_tensor_slices((partial_train_data, partial_train_labels))\n",
        "training_set = training_set.shuffle(TRAINING_SIZE).batch(BATCH_SIZE)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "W8S-IkYMTsHm",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### Step 8) Training your model\n",
        "Please be patient as this step may take a while."
      ]
    },
    {
      "metadata": {
        "id": "3OuvepxiTu7m",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "cell_type": "code",
      "source": [
        "EPOCHS = 20\n",
        "\n",
        "# Store list of metric values for plotting later\n",
        "tr_loss_list = []\n",
        "tr_accuracy_list = []\n",
        "val_loss_list = []\n",
        "val_accuracy_list = []\n",
        "\n",
        "for epoch in range(EPOCHS):\n",
        "  for newswires, labels in training_set:\n",
        "    # Calculate training loss and accuracy\n",
        "    tr_loss, tr_accuracy = model.train_on_batch(newswires, labels)\n",
        "  \n",
        "  # Calculate validation loss and accuracy\n",
        "  val_loss, val_accuracy = model.evaluate(val_data, val_labels)\n",
        "\n",
        "  # Add to the lists\n",
        "  tr_loss_list.append(tr_loss)\n",
        "  tr_accuracy_list.append(tr_accuracy)\n",
        "  val_loss_list.append(val_loss)\n",
        "  val_accuracy_list.append(val_accuracy)\n",
        "  \n",
        "  print(('Epoch #%d\\t Training Loss: %.2f\\tTraining Accuracy: %.2f\\t'\n",
        "         'Validation Loss: %.2f\\tValidation Accuracy: %.2f')\n",
        "         % (epoch + 1, tr_loss, tr_accuracy,\n",
        "         val_loss, val_accuracy))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "TR0thBUhUDTb",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### Step 9) Plotting your loss and accuracy\n",
        "We are going to use `matplotlib` to plot our training and validation metrics."
      ]
    },
    {
      "metadata": {
        "id": "FEekeJKbTqj-",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "cell_type": "code",
      "source": [
        "import matplotlib.pyplot as plt\n",
        "\n",
        "epochs = range(1, EPOCHS + 1)\n",
        "\n",
        "# \"bo\" specifies \"blue dot\"\n",
        "plt.plot(epochs, tr_loss_list, 'bo', label='Training loss')\n",
        "# b specifies a \"solid blue line\"\n",
        "plt.plot(epochs, val_loss_list, 'b', label='Validation loss')\n",
        "plt.title('Training and validation loss')\n",
        "plt.xlabel('Epochs')\n",
        "plt.ylabel('Loss')\n",
        "plt.legend()\n",
        "\n",
        "plt.show()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "XWybdzYEUE7p",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "cell_type": "code",
      "source": [
        "plt.clf()   # Clear plot\n",
        "\n",
        "plt.plot(epochs, tr_accuracy_list, 'bo', label='Training accuracy')\n",
        "plt.plot(epochs, val_accuracy_list, 'b', label='Validation accuracy')\n",
        "plt.title('Training and validation accuracy')\n",
        "plt.xlabel('Epochs')\n",
        "plt.ylabel('Accuracy')\n",
        "plt.legend()\n",
        "\n",
        "plt.show()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "0UdrazksUI0_",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### Step 10) Testing your model\n",
        "Now that we have successfully trained our model and our training accuracy has jumped over 90%, we need to test it. The test accuracy is a better evaluation metric for how our model will perform in the real world."
      ]
    },
    {
      "metadata": {
        "id": "ta7jDHRwUGIa",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "cell_type": "code",
      "source": [
        "loss, accuracy = model.evaluate(test_data, test_labels)\n",
        "print('Test accuracy: %.2f' % (accuracy))"
      ],
      "execution_count": 0,
      "outputs": []
    }
  ]
}